Modeling of microarray data with zippering 
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The ability of oligonucleotide microarrays to measure gene expression has been hindered by an 
imperfect understanding of the relationship between input RNA concentrations and output signals. 
We argue that this relationship can be understood based on the underlying statistical mechanics of 
these devices. We present a model that includes the relevant interactions between the molecules. 
Our model for the first time accounts for partially zippered probe-target hybrids in a physically 
realistic manner, and also includes target-target binding in solution. Large segments of the target 
molecules are not bound to the probes, often in an asymmetric pattern, emphasizing the impor- 
tance of modeling zippering properly. The resultant fit between the model and training data using 
optimized parameters is excellent, and it also does well at predicting test data. 

PACS numbers: 81.16.Fg, 82.35. Pq, 87.14Gg, 87.15Aa, 87.15Cc, 87.80.Tq 
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Oligonucleotide microarrays have had a profound im- 
pact on medical diagnosis and molecular biology. These 
devices have thousands of cells, each containing numer- 
ous copies of a specific DNA probe attached to the sub- 
strate. After amplification, cRNA (or DNA) derived 
from a biological sample is fluorescently labeled and then 
fragmented into targets that are allowed to bind to the 
probes. The targets hybridize with their complementary 
DNA probes with high specificity. The fluorescence al- 
lows an optical readout of the concentration of thousands 
of RNA transcripts simultaneously. 

There are often orders of magnitude difference in the 
concentrations of the various kinds of targets and many 
important genes are expressed only in very small concen- 
trations. These can be hard to measure by this method 
for reasons we now discuss. 

In Affymetrix GeneChips, 16 kinds of probe, all with 
the same concentration and probing different segments 
of a single niRNA, are employed for every transcript. 
There can be order of magnitude variations in the signal 
intensities between these 16 probes. The specific signal 
intensities are reproducible and so cannot be explained 
as statistical error. Instead, they occur because the in- 
teractions between the probes and the target molecules 
are complex. In addition to the binding between a probe 
and its complementary target (specific binding), there 
are binding of targets to noncomplementary probes (non- 
specific binding) and target-target binding. In addition, 
even when a target binds to its complementary probe, it 
is possible that it is only partially hybridized, with the 
ends unzipped. This is likely to be an important effect 
in microarrays, since their operating temperature is close 
to the melting temperature of the hybrids, so that sub- 
stantial fluctuations in binding can be expected. 

Various statistical techniques have been used to re- 
duce the errors caused by these effects. For example 
Affymetrix, in addition to using 16 probes for every tran- 
script, has added another set of "mismatch" probes that 
differ from the original "perfect match" probes by the 



alteration of the middle nucleotide. These are compared 
to reduce the error from nonspecific binding. Although 
these techniques reduce the uncertainties in predicted in- 
put concentrations, by analyzing 32 numbers, they still 
have difficulty in detecting small concentrations of RNA 
at biologically significant levels. Alternative statistical 
approaches have been proposed [l|. 

As recognized by several previous authors [2, 0, 3 , the 
statistical techniques used by Affymetrix and others do 
not utilize the probe sequence information. The hope is 
that including these should greatly increase the reliability 
of predictions. 

Held et al Q modeled the binding of each target 
molecule to its complementary probe using a Langmuir 
adsorption model 5] , If c is the concentration in solution 
of the target, the fraction of probe molecules that are 
hybridized is 
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1 + exp[/3(AG - /i)] 



(1) 



where exp(/3/i) ex c. The binding energy AG [g was 
obtained from previously measured stacking free ener- 
gies 0, ■ AH target-probe pairs were treated indepen- 
dently, with nonspecific binding included phcnomcnolog- 
ically by adding a AG dependent constant to the mea- 
sured signal, to be of the form a + 6cxp[cAG]. Par- 
tially hybridized probes were not considered, although 
the authors comment that they might be of impor- 
tance. Perhaps because of this, in their Boltzmann fac- 
tors ^ exp(/3AG), the best fit effective temperature was 
approximately seven times the actual temperature. 

Hekstra et al 3] also used a Langmuir adsorption 
model with an additive background from nonspecific 
binding. However, the resultant three parameters for 
each probe-target pair were not evaluated using previ- 
ously determined stacking energies, but were fitted to 
linear combinations of the number of each nucleotide. 

Zhang et al [Jl attempted to include the effects of par- 
tial binding of target molecules to probes. The model 



they came up with was the "positional dependent near- 
est neighbor model" (PDNN) where the binding energy 
for a probe-target pair was taken to be of the form 
AG = ^j, Wfce(6fe, 6fc+i), where e is the stacking energy 
for the adjacent bases bk and 6fc+i, and {oJk} was a set 
of weights depending only on the position on the probe 
k, and not on the specific probe molecule. The probe 
intensities were assumed to be linear in the target con- 
centrations, i.e. saturation effects were not included. 

The above methods incorporating sequence informa- 
tion have different approaches to data fitting, with vari- 
ous physical effects included. In this paper we present 
a comprehensive approach that includes what we be- 
lieve to be all the most important effects, to construct 
a physical model for microarrays. In particular we have 
included "zippering effects" , i.e. target molecules par- 
tially hybridized to probes, by using a full statistical me- 
chanical approach rather than ad-hoc position-dependent 
weights. This reduces the number of fitting parameters 
enormously, because partial binding can be understood 
completely as a consequence of the stacking energies. We 
also include the effect of target-target binding in solution. 
As pointed out by Held et al 2|, the saturation intensi- 
ties of different probes that correspond to target frag- 
ments from the same mRNA molecule can be different 
by an order of magnitude. It is likely that this is because 
of these target target interactions, reducing the effective 
concentration in solution of different targets by differ- 
ent amounts. Since RNA-RNA interactions are stronger 
than RNA-DNA 0], this effect can be substantial. We 
also include non-specific binding for each probe similarly. 

We present our model in three parts: specific bind- 
ing, including zippering, non-specific binding, and finally 
target-target interactions in solution. 

Consider a target molecule consisting of a sequence 
of bases {61 . . . Bn} (where A^ = 25 for the Affymetrix 
GeneChip) . According to the stacking energy description 
of hybridization, if this target molecule is fully bound to 
its complementary probe, the resultant change in free 
energy is of the form 
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AG(l,7V)=^e(6fc,6fc+i) 



(2) 



fc=i 



where e^ is the initiation energy of attachment. However, 
it is also possible for the target to be partially bound to 
the probe, with only the bases n to m being bound. This 
configuration would have a AG(n, m) given by Eq.|(21) but 
with the sum from k — ntok — m—1. Because the local 
stiffness is large, and the target-probe hybrid is in a heli- 
cal structure, we only need to consider configurations for 
which the unbound parts start at the ends, rather than 
forming isolated islands in the middle. Thus a target- 
probe pair can be viewed as a double-ended zipper. 

The resultant partition function for the bound state 
that includes all partially bound configurations is 

Z = Y^ exp(-/3AG(n, m)) = exp(-/3AG) (3) 



where AG is the total binding free energy. Naively, 
this takes 0{N^) operations to compute, which is pro- 
hibitively expensive, because AG is computed repeat- 
edly when optimizing the model. However, Z can be 
computed in 0{N) operations using recursion relations. 
Define Z{i) as the analog of Z in Eq.(|3Jl, but with only 
the bases from 1 to i included. Z(i) is the sum of two 
terms: Zu{i), which considers configurations that are un- 
bound at the site i (but have a bound segment somewhere 
before i), and Zi,{i), which considers configurations that 
are bound at the site i. The recursion relations for these 



are 



Z„(i + f) = Zuii) + Zb{i) 

Zb{i+1) = Zfc(i)exp(-/3e(6„6,+i))-hexp(-/3eO.(4) 



In the first of these equations, configurations that are 
attached at i can detach at i + 1. However, in the second 
equation, because we allow only one bound segment in 
our zipper model, and configurations in Zu{i) have al- 
ready had a bound segment before the base i, there is no 
contribution from Z„. These recursion relations are are 
a simplification of those in Ref. [lOl [Ij ■ The AG from 
Eq.(|3Jl can now be used in Eq.Q. 

The next effect we consider is nonspecific binding. In 
principle, this could be accomplished with an approach 
very similar to the one we have constructed for specific 
binding. This would require a complete knowledge of 
all molecular fragments present in the solution; these in- 
clude a background of human RNA, and the additional 
"spiked-in" target molecules that the experiment tries 
to measure. It would also be necessary to determine 
the stacking energies for all 4* possible mismatched (or 
matched) sequences of two base pairs. Since neither of 
these is fully known, we use a statistical approach to 
model non-specific binding. We assume that the RNA 
giving rise to non-specific binding is sufficiently diverse 
that it can be treated as a 'bath' of random sequences. 
We also make the approximation that stacking energies 
can be defined in terms of the nearest neighbors on the 
probe. If {bk} is the probe sequence and {ck} is a (non- 
complementary) target sequence, this approximation is 
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-I3e'{bk,bk + i) 
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where e' is an effective stacking energy. In the absence 
of experimental knowledge of the 256 stacking energies, 
not using this approximation would result in so many 
adjustable parameters in our model so as to make any 
results meaningless. 

Finally, we include target-target binding in solution. 
As mentioned earlier, there are substantial variation in 
the saturated probe intensities for different target frag- 
ments from a single mRNA molecule. Further, the in- 
tensities saturate at a much higher concentration of tar- 
get molecules than one would expect from the number 



of probe molecules in a single cell [ij- This strongly 
suggests that the target concentration in solution is sig- 
nificantly depleted, by an amount that differs from one 
target species to another. We model the fraction of any 
target species that is lost to target-target binding by us- 
ing the analogs of Eq.l^. For reasons similar to those 
for non-specific binding 14], we model this statistically. 

We performed numerical simulations using the model 
described above using Affymetrix's Series 1532 Latin 
Square [13 data. In these experiments, a cocktail of hu- 
man cRNA was "spiked in" to a background of human 
RNA of unknown composition. In any experiment, the 
transcripts had concentrations 0, 0.25, 0.5 . . . 1024 pM. 
The transcripts whose concentration was large varied 
from one experiment to another cyclically, in a pattern 
that formed a Latin Square matrix. Each cRNA tran- 
script could hybridize with 16 different probe sequences 
at different positions along the original transcript. 

In the simulations, we varied the parameters of the 
model in order to minimize the fitness, the log mean 
square difference between the measured signal intensities 
and the model predictions: 



F = ^ [In /„,,„, - In Ipred?/M 



(6) 



where the sum runs over probes and experiments, and M 
is the total number of data points used. We followed the 
common practice of using the logarithm in this definition, 
because the intensities vary over orders of magnitude, 
and doing otherwise would discount low concentration 
transcripts, which are important biologically. The data 
from probes 407_at and 36889_at was not used, follow- 
ing Affymetrix's recommendation. The minimization of 
F as a function of model parameters is a computation- 
ally intensive optimization problem. In order to increase 
convergence to the solution, we used several techniques, 
of which parallel tempering [ig was found to work best. 

The parameters in the model were 16 stacking ener- 
gies and one initiation energy for specific, non-specific 
and target-target binding (51 parameters), the number of 
probe molecules for each species, a scale factor convert- 
ing from hybridization to signal intensity, and a (small) 
uniform background to the signal. The total number of 
parameters was 54. Although DNA-RNA stacking ener- 
gies for matched bases are known Q, we allowed these 
to vary as free parameters. This was because the addi- 
tion of fluorescent tags to the RNA molecules has been 
shown [13 to change the stacking energies. The final op- 
timized stacking energies were of the same order of mag- 
nitude as experimental results for untagged molecules 9] . 

Although 54 fitting parameters might seem to be a 
large number, the model was used to fit 2464 data points, 
and our number of parameters compares favorably with 
prior work |j]. Importantly, when we randomly shuffled 
the sequences associated with the different probes and 
redid the optimization, the function F in Eq. @ increased 
by a factor of more than two, indicating that our results 
were not an artifact of having too many parameters. 
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FIG. 1: Plot of the signal intensities as measured (solid line) 
and from the model (dashed line) for all the probes (except 
for those corresponding to three transcripts) for a typical ex- 
periment. The input concentration of each transcript is also 
shown (not on the same scale); since there are sixteen probes 
for each transcript, this is a piecewise constant curve. 




FIG. 2: Measured (solid line) and predicted (dashed line) 
intensities for probes for transcript 37777_at from the 
AfFymetrix Latin Square data. This transcript was not used 
for model optimization. Results of a similar procedure with 
the "PDNN" model Q are also shown (dotted line). 



Apart from being able to fit the given data accurately, 
the purpose of a model is to be able to analyze new exper- 
iments with unknown concentrations of transcripts, and 
accurately predict these concentrations. Accordingly, we 
left out some some (one or three) of the transcripts during 
parameter estimation (in addition to transcripts 407_at 
and 36889_at mentioned earlier). After this data had 
been used to train the model, we tried to predict the 
concentrations of the transcripts that had been left out. 

Figure ^ shows the measured signal intensities for all 
the probes in a typical experiment, except for those corre- 
sponding to three transcripts: one to be used in the sub- 
sequent prediction stage, and the two that were known to 
be unreliable. The figure also shows the signal intensities 
from the model, with optimized parameters, and the in- 
put concentrations. The model reproduces the measured 
intensities quite well. The model parameters were opti- 
mized once for all the experiments simultaneously. The 
residual error F, defined in Eq.®, was 0.19. 

With the optimal parameters obtained above, the 
model was used to predict the signal intensities for the 
probes corresponding to the transcript that was left out 
in the training stage, as shown in Figure [3 The same 
figure shows results (with the same procedure) using the 
PDNN model of Ref. |j|. As seen in the figure, the pre- 
dictions with our model are much better. 

We also show the prediction capabilities of the model 
with a different procedure, which is similar to the one 
used by Ref. [2|. Three transcripts were left out (in 
addition to the two faulty ones) in the training stage. 
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FIG. 3: Predicted input concentrations as a function of the 
actual input concentrations for three transcripts. These were 
not included in the parameter optimization of the model. 



In the prediction stage, for each experiment, the sixteen 
measured probe intensities were used collectively to pre- 
dict the input concentration of each of the three excluded 
transcripts 'lq|. The results are shown in Figure 13 

When target-target interactions were omitted from our 
model, the residual error F increased to 0.27. However, 
the significance of this is hard to interpret, since the 
number of parameters is thereby reduced from 54 to 37. 
Therefore, we assessed this reduced model for prediction. 
We found a noticeable degradation in predictive power at 
low input concentrations compared to Figure |21 
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FIG. 4: Fraction of time a base is unbound as a function 
of position (from the solution end) along the probe, for 3 
different probes. 



Lastly, in Figure^jwe present some data on partial zip- 
pering. The figure displays the fraction of the time a base 
pair is unbound as a function of position on the probe. 
As is evident, these probes are only partially bound. The 
places they bind are not symmetric about the middle, and 
depend strongly on the probe sequence. This kind of be- 
havior differs from that of previous models. Evidence for 
this kind of asymmetric partial binding can be seen from 
careful experiments on Agilent microarrays pL9| . 

In this paper, we have constructed a physical model for 
hybridization of RNA transcript targets to probes in mi- 
croarrays, in order to predict experimental signal inten- 
sities. Our model includes for the first time the effect of 
partial hybridization of probes (zippering) derived from 
fundamental statistical mechanics, and also the effect of 
target-target interactions. The prediction capabilities of 
our model appear to surpass those of other approaches. 
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